Understanding Key File Formats in High-Throughput Research

High-throughput technologies have revolutionized the field of life sciences, producing vast quantities of data at unprecedented speed. However, to make sense of these massive data sets, you need to familiarize yourself with the different types of file formats typically used. Below, we will discuss some of the most common file formats in high-throughput research, including FASTQ, FASTA, BAM/SAM, and BAI.

FASTQFASTABAMBAISAMVCFGFFGTFBEDBedGraphBigWigPDB

Table of files

File FormatLink/Anchor
FASTQFASTQ File Format
FASTAFASTA File Format
BAM/SAMBAM File Format
BAIBAI File Format
SAMSAM File Format
VCFVCF File Format
GFF/GTFGFF/GTF File Format
BEDBED File Format
BedGraphBedGraph File Format
BigWigBigWig File Format
PDBPDB File Format

1. FASTQ File Format

FASTQ files are widely used in bioinformatics for storing raw sequence data and corresponding quality scores. Each entry in a FASTQ file includes four lines:

  1. A sequence identifier with an optional description
  2. The raw sequence
  3. A separator line, often a single '+'
  4. Quality scores for each base in the raw sequence
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
- !''\*((((**_+))%%%++)(%%%%).1_**-+\*''))\*\*55CCF>>>>>>CCCCCCC65

FASTQ files are widely used in Next-Generation Sequencing (NGS) technologies such as Illumina, SOLiD, and Ion Torrent.


2. FASTA File Format

FASTA format is a simple and widely used format for representing nucleotide sequences (DNA, RNA) or protein sequences. A FASTA file starts with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol at the beginning.

>SEQ_ID Description of the sequence
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC
TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA
TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC

It's worth noting that the FASTA format does not contain quality scores, which is a major difference from the FASTQ format. FASTA files are frequently used in genome assemblies and gene prediction methods, as well as in sequence alignment and homology searches.


3. BAM File Format

The Binary Alignment/Map (BAM) format is a binary, compressed representation of sequence alignment data. BAM files can store aligned sequences from high-throughput sequencing technologies, making them useful for representing sequence reads aligned to reference genomes. They can handle large amounts of data efficiently, which is essential in the high-throughput era.

Visualization of BAM data as text from tools like Samtools

seq1    99      ref     7       30      8M2I4M1D3M      =       37      39      TTAGATAAAGGATACTG   *       NM:i:1  MD:Z:8G4^C3
seq2    163     ref     9       30      3S6M1P1I4M      =       39      39      AAAAGATAAGGATA      *       NM:i:0  MD:Z:10

A BAM file includes a header section and an alignment section. The header contains information about the reference sequences and the alignment process, while the alignment section contains the alignment information for individual sequence reads.


4. BAI File Format

A BAM Index file (BAI) accompanies a BAM file. It's a binary file that provides quick access to the alignment data for a region of the genome in the corresponding BAM file. This feature is useful when working with large data sets, as it allows researchers to access specific genomic regions without having to scan the entire file. Using a BAI file, researchers can quickly retrieve all reads aligned to a particular region, making it invaluable for tasks such as visualizing data in a genome browser or extracting data from targeted genomic regions.

Chromosome 1:    -----------------------------
BAI pointers:    ^        ^         ^

Chromosome 2:    -------------------------------
BAI pointers:    ^     ^      ^       ^

BAM file:    |--------|--------|--------|--------|--------|

In this diagram:

  • The lines labelled "Chromosome 1" and "Chromosome 2" represent two different chromosomes or reference sequences. The length of each line represents the length of the chromosome, with each "-" being a placeholder for a portion of the sequence.
  • The "^" under each chromosome line represent the BAI pointers. Each pointer corresponds to a specific region of the chromosome.
  • The BAI pointers "point" to blocks of data within the BAM file (represented by the boxes at the bottom). For example, all the data for the region of Chromosome 1 between the first and second pointers are stored in the first box of the BAM file.

5. SAM File Format

SAM (Sequence Alignment/Map) is a tab-delimited text format designed for storing biological sequences aligned to a reference sequence. It's essentially the human-readable version of the binary BAM format. It consists of a header and an alignment section.

The header section starts with '@' and includes information such as the reference sequence names and lengths, the programs used for alignment, and the sequencing platform. The alignment section, on the other hand, contains information about each read and its alignment to the reference.

r001   99   ref   7   30  8M2I4M1D3M  =   37  39  TTAGATAAAGGATACTG   *   NM:i:1  MD:Z:8G4^C3

Here is what the columns represent:

  1. r001: Query template NAME.
  2. 99: bitwise FLAG.
  3. ref: Reference sequence NAME.
  4. 7: 1-based leftmost mapping POSition.
  5. 30: MAPping Quality.
  6. 8M2I4M1D3M: CIGAR string.
  7. =: Reference name of the mate/next read.
  8. 37: Position of the mate/next read.
  9. 39: observed Template LENgth.
  10. TTAGATAAAGGATACTG: segment SEQuence.
  11. *: ASCII of Phred-scaled base QUALity+33.
  12. NM:i:1: Edit distance to the reference.
  13. MD:Z:8G4^C3: String for mismatching positions.

The SAM format can get quite complex due to the number of optional fields and the bitwise FLAG field, which can represent several attributes of the read in binary. The CIGAR string is a compact representation of the alignment of the read to the reference genome, where 'M' denotes match or mismatch, 'I' denotes insertion, and 'D' denotes deletion.

This entry represents just a single sequence read. A SAM file can contain millions or even billions of such entries, often accompanied by a header section with information about the sequencing run and alignment.


6. VCF File Format

VCF (Variant Call Format) is a text file format for storing gene sequence variations. VCF files are primarily used in bioinformatics for representing SNP, indel, and structural variation data.

A VCF file includes meta-information lines, a header line, and then data lines each containing information about a position in the genome. The format also has the ability to contain genotype information on samples for each position.

##fileformat=VCFv4.2
##FILTER=<ID=LowQual,Description="Low quality">
##contig=<ID=20,length=63025520,assembly=B37>
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
20      14370   rs6054257       G       A       29      PASS    NS=3;DP=14;AF=0.5;DB;H2
20      17330   .       T       A       3       q10     NS=3;DP=11;AF=0.017

Let's break this down:

  • The lines beginning with ## are meta-information lines providing details about the file like file format, filter details, and reference information.
  • The line beginning with #CHROM is the column header.
  • Each subsequent line represents a single variant:
    • CHROM and POS define the position of the variant on the reference genome.
    • ID is the identifier of the variant, often an rsID.
    • REF is the reference base(s).
    • ALT is the alternate base(s), i.e., the observed variant.
    • QUAL is a Phred-scaled quality score for the assertion made in ALT.
    • FILTER is a pass/fail flag for quality control.
    • INFO contains additional information about the variant.

Remember, VCF files can become quite complex, especially in the INFO column and when dealing with multiple samples (additional columns beyond INFO). This snippet is a minimal example to give you the basic idea.


7. GFF/GTF File Format

GFF (General Feature Format) and GTF (Gene Transfer Format) are both file formats used for describing genes and other features of DNA, RNA, and protein sequences. The formats consist of one line per feature, each containing nine columns. The columns are "seqname", "source", "feature", "start", "end", "score", "strand", "frame", and "attribute". A GTF file is essentially a specific type of GFF, often used in conjunction with genome annotation/assembly.

##gff-version 3
ctg123 . gene            1000  9000  .  +  .  ID=gene00001;Name=mygene
ctg123 . mRNA            1050  9000  .  +  .  ID=mRNA00001;Parent=gene00001;Name=mRNA1
ctg123 . exon            1050  1500  .  +  .  ID=exon00001;Parent=mRNA00001
ctg123 . exon            3000  3902  .  +  .  ID=exon00002;Parent=mRNA00001
ctg123 . three_prime_UTR 5000  9000  .  +  .  ID=three_prime_UTR00001;Parent=mRNA00001

Let's break down the columns:

  1. ctg123: Reference sequence (e.g., chromosome, scaffold, contig).
  2. '.': Source of the feature (often a specific software or prediction method, here not specified).
  3. Feature type (gene, mRNA, exon, etc).
  4. 1000, 1050, etc: Start position of the feature.
  5. 9000, 1500, etc: End position of the feature.
  6. '.': Score (here not specified).
  7. '+': Strand (could also be - or . for not stranded).
  8. '.': Frame (0, 1, 2, or . for not applicable).
  9. Attribute field: A semicolon-separated series of tag-value pairs, providing additional information about each feature.

This GFF3 file describes a gene located on ctg123 from position 1000 to 9000 on the positive strand. The gene has an mRNA, with exons and a 3' UTR specified. Note that the example is highly simplified, and real GFF3 files can be much more complex, especially in the attributes (9th) field.

8. BED File Format


The BED (Browser Extensible Data) format is a flexible, column-based format for defining data lines that are displayed in an annotation track. BED files are used in a variety of tasks, such as finding significant overlap between large datasets, and visualizing data in genome browsers.

BED files have three required fields - chromosome, start position, and end position, and nine additional optional fields. The optional fields allow for detailed information about the feature, such as its name, score, strand, etc.

chr1    1300    9000    feature1    0    +
chr1    1350    2000    feature2    0    -
chr2    3000    3902    feature3    0    +
chr2    5000    6000    feature4    0    -

Here's what the columns represent:

  1. 'chr1', 'chr2': The name of the chromosome or scaffold. The prefix "chr" is optional and the values can change depending on the genome.
  2. '1300', '1350', etc: The starting position of the feature in the chromosome. The first base in a chromosome is numbered 0.
  3. '9000', '2000', etc: The ending position of the feature in the chromosome. BED coordinates are zero-based and half-open. This means that the start is 0-based and the end is exclusive (or 1-based). So the bases included are those up to but not including the end base.
  4. 'feature1', 'feature2', etc: The name of the BED line.
  5. '0': The score of the feature. Here, we have used 0 for all features as a placeholder.
  6. '+', '-': The strand of the feature. Either "+" or "-".

In its simplest form, a BED file requires only the first three fields. Additional fields can be added for more complex data. In addition, the use of track and browser lines can provide further customization for display in genome browsers.


9. BedGraph File Format

BedGraph files are designed to represent continuous data along the genome, such as signal intensities or coverage levels. They provide a way to visualize and store numerical values at each position in the genome. A BedGraph file typically consists of four columns: chromosome, start position, end position, and value. The values in BedGraph files can be positive, negative, or zero, representing features like read coverage, gene expression levels, or ChIP-seq signal intensities.

track type=bedGraph name="BedGraph Format" description="BedGraph format" visibility=full color=200,100,0 altColor=0,100,200 priority=20
chr1    1000    2000    0.25
chr1    2000    3000    0.50
chr1    3000    4000    0.75
chr2    1000    2000    1.00
chr2    2000    3000    0.85

Here's what the columns represent:

  1. 'chr1', 'chr2': The name of the chromosome or scaffold.
  2. '1000', '2000', etc: The starting position of the feature in the chromosome. The first base in a chromosome is numbered 0.
  3. '2000', '3000', etc: The ending position of the feature in the chromosome. BEDGraph coordinates are also 0-based, half-open.
  4. '0.25', '0.50', etc: A single data value associated with the BED interval.

The first line of a BEDGraph (beginning with the word track) is the track definition line, and this provides configuration settings for the display of this track. It includes details like track type, name, description, visibility, color, and priority.


10. BigWig File Format

BigWig files are a binary file format commonly used in bioinformatics for efficient storage and retrieval of large-scale genomic data, such as signal intensities, coverage tracks, or other quantitative measurements. BigWig files are primarily designed for visualization and analysis in genome browsers and other genome data analysis tools.

BigWig files are compressed and indexed, allowing for fast random access to specific genomic regions. They provide an efficient representation of continuous numerical data across the genome and allow for zooming in and out of different genomic scales without the need to load the entire dataset into memory. BigWig files are compatible with popular genome browsers like UCSC Genome Browser and Integrative Genomics Viewer (IGV).


11. PDB File Format

The Protein Data Bank (PDB) format is used to store three-dimensional data of proteins and nucleic acids. This format is widely used in the fields of molecular modeling, structural bioinformatics, protein design, drug discovery, and more. A PDB file consists of several sections providing different types of data, including information about atoms, connectivity, sequences, and crystallographic structure.

HEADER    ALANINE                                    10-MAY-23
TITLE     EXAMPLE STRUCTURE OF ALANINE
COMPND    MOL_ID: 1;
COMPND   2 MOLECULE: ALANINE;
COMPND   3 CHAIN: A;
COMPND   4 ENGINEERED: YES;
ATOM      1  N   ALA A   1      -0.677   0.000   0.000  1.00 20.00           N
ATOM      2  CA  ALA A   1       0.603   0.000   0.000  1.00 20.00           C
ATOM      3  C   ALA A   1       1.273   1.212   0.000  1.00 20.00           C
ATOM      4  O   ALA A   1       0.603   2.212   0.000  1.00 20.00           O
ATOM      5  CB  ALA A   1       1.273  -0.788   1.212  1.00 20.00           C
ATOM      6  H   ALA A   1      -1.193  -0.788  -0.515  1.00 20.00           H
ATOM      7  HA  ALA A   1       0.603  -0.788  -0.515  1.00 20.00           H
ATOM      8  HB1 ALA A   1       0.603  -0.788   2.212  1.00 20.00           H
ATOM      9  HB2 ALA A   1       1.943  -1.576   1.212  1.00 20.00           H
ATOM     10  HB3 ALA A   1       1.943   0.000   1.727  1.00 20.00           H
END

Let's break this down:

  • HEADER, TITLE, and COMPND lines provide general information about the molecule.
  • ATOM lines contain atomic coordinates and other information for each atom in the molecule. In particular:
    • The number after ATOM is the atom serial number.
    • N, CA, C, etc., are atom names.
    • ALA is the three-letter code for the amino acid (alanine in this case).
    • A is the chain identifier.
    • The following three numbers represent the x, y, and z coordinates of each atom.
    • The last two numbers are occupancy and temperature factor, respectively. In this example, they are just placeholder values.
  • END marks the end of the file.

Please note that PDB files can become quite complex when dealing with large proteins, and may also contain additional information such as secondary structure, connectivity, and crystallographic data. This snippet is a minimal example to give you the basic idea.

Conclusion

Understanding these file formats is a fundamental part of working in high-throughput research. Each file type has unique attributes and is used in different contexts depending on the type of analysis you are performing. Familiarity with these formats enables efficient handling, processing, and analysis of high-throughput data, paving the way for insightful biological discoveries. Remember, high-quality data analysis begins with understanding your data at its most basic level – its format. So next time you encounter a FASTQ, FASTA, BAM, or BAI file, you’ll know exactly what it contains and how best to use it in your research.